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Abstract 


A growing number of state-of-the-art trans- 
fer learning methods employ language mod- 
els pretrained on large generic corpora. In this 
paper we present a conceptually simple and 
effective transfer learning approach that ad- 
dresses the problem of catastrophic forgetting. 
Specifically, we combine the task-specific op- 
timization function with an auxiliary language 
model objective, which is adjusted during the 
training process. This preserves language reg- 
ularities captured by language models, while 
enabling sufficient adaptation for solving the 
target task. Our method does not require pre- 
training or finetuning separate components of 
the network and we train our models end-to- 
end in a single step. We present results on a va- 
riety of challenging affective and text classifi- 
cation tasks, surpassing well established trans- 
fer learning methods with greater level of com- 
plexity. 


1 Introduction 


Pretrained word representations captured by Lan- 
guage Models (LMs) have recently become pop- 
ular in Natural Language Processing (NLP). Pre- 
trained LMs encode contextual information and 
high-level features of language, modeling syntax 
and semantics, producing state-of-the-art results 
across a wide range of tasks, such as named entity 
recognition (Peters et al., 2017), machine transla- 
tion (Ramachandran et al., 2017) and text classifi- 
cation (Howard and Ruder, 2018). 

However, in cases where contextual embed- 
dings from language models are used as additional 
features (e.g. ELMo (Peters et al., 2018)), results 
come at a high computational cost and require 
task-specific architectures. At the same time, ap- 
proaches that rely on fine-tuning a LM to the task 
at hand (e.g. ULMFiT (Howard and Ruder, 2018)) 
depend on pretraining the model on an exten- 
sive vocabulary and on employing a sophisticated 


slanted triangular learning rate scheme to adapt the 
parameters of the LM to the target dataset. 

We propose a simple and effective transfer 
learning approach, that leverages LM contextual 
representations and does not require any elaborate 
scheduling schemes during training. We initially 
train a LM on a Twitter corpus and then transfer 
its weights. We add a task-specific recurrent layer 
and a classification layer. The transferred model 
is trained end-to-end using an auxiliary LM loss, 
which allows us to explicitly control the weighting 
of the pretrained part of the model and ensure that 
the distilled knowledge it encodes is preserved. 

Our contributions are summarized as follows: 
1) We show that transfer learning from language 
models can achieve competitive results, while also 
being intuitively simple and computationally ef- 
fective. 2) We address the problem of catastrophic 
forgetting, by adding an auxiliary LM objective 
and using an unfreezing method. 3) Our results 
show that our approach is competitive with more 
sophisticated transfer learning methods. We make 
our code widely available. ! 


2 Related Work 


Unsupervised pretraining has played a key role in 
deep neural networks, building on the premise that 
representations learned for one task can be use- 
ful for another task. In NLP, pretrained word vec- 
tors (Mikolov et al., 2013; Pennington et al., 2014) 
are widely used, improving performance in vari- 
ous downstream tasks, such as part-of-speech tag- 
ging (Collobert et al., 2011) and question answer- 
ing (Xiong et al., 2016). These pretrained word 
vectors serve as initialization of the embedding 
layer and remain frozen during training, while our 
pretrained language model also initializes the hid- 
den layers of the model and is fine-tuned to each 
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classification task. 


Aiming to learn from unlabeled data, Dai and 
Le (2015) use unsupervised objectives such as se- 
quence autoencoding and language modeling for 
as pretraining methods. The pretrained model is 
then fine-tuned to the target task. However, the 
fine-tuning procedure of the language model to the 
target task does not include an auxiliary objective. 
Ramachandran et al. (2017) also pretrain encoder- 
decoder pairs using language models and fine-tune 
them to a specific task, using an auxiliary lan- 
guage modeling objective to prevent catastrophic 
forgetting. This approach, nevertheless, is only 
evaluated on machine translation tasks; moreover, 
the seq2seq (Sutskever et al., 2014) and language 
modeling losses are weighted equally throughout 
training. By contrast, we propose a weighted sum 
of losses, where the language modeling contribu- 
tion gradually decreases. ELMo embeddings (Pe- 
ters et al., 2018) are obtained from language mod- 
els and improve the results in a variety of tasks 
as additional contextual representations. However, 
ELMo embeddings rely on character-level models, 
whereas our approach uses a word-level LM. They 
are, furthermore, concatenated to pretrained word 
vectors and remain fixed during training. We in- 
stead propose a fine-tuning procedure, aiming to 
adjust a generic architecture to different end tasks. 


Moreover, BERT (Devlin et al., 2018) pretrains 
language models and fine-tunes them on the tar- 
get task. An auxiliary task (next sentence predic- 
tion) is used to enhance the representations of the 
LM. BERT fine-tunes masked bi-directional LMs. 
Nevertheless, we are limited to a uni-directional 
model. Training BERT requires vast computa- 
tional resources, while our model only requires 1 
GPU. We note that our approach is not orthogo- 
nal to BERT and could be used to improve it, by 
adding an auxiliary LM objective and weighing its 
contribution. 


Towards the same direction, ULMFiT (Howard 
and Ruder, 2018) shows impressive results on a 
variety of tasks by employing pretrained LMs. 
The proposed pipeline requires three distinct steps, 
that include (1) pretraining the LM, (2) fine-tuning 
it on a target dataset with an elaborate schedul- 
ing procedure and (3) transferring it to a classifica- 
tion model. Our proposed model is closely related 
to ULMFiT. However, ULMFiT trains a LM and 
fine-tunes it to the target dataset, before transfer- 
ring it to a classification model. While fine-tuning 


the LM to the target dataset, the metric (e.g. ac- 
curacy) that we intend to optimize cannot be ob- 
served. We propose adopting a multi-task learning 
perspective, via the addition of an auxiliary LM 
loss to the transferred model, to control the loss 
of the pretrained and the new task simultaneously. 
The intuition is that we should avoid catastrophic 
forgetting, but at the same time allow the LM to 
distill the knowledge of the prior data distribution 
and keep the most useful features. 

Multi-Task Learning (MTL) via hard parame- 
ter sharing (Caruana, 1993) in neural networks 
has proven to be effective in many NLP prob- 
lems (Collobert and Weston, 2008). More re- 
cently, alternative approaches have been suggested 
that only share parameters across lower layers (So- 
gaard and Goldberg, 2016). By introducing part- 
of-speech tags at the lower levels of the network, 
the proposed model achieves competitive results 
on chunking and CCG super tagging. Our auxil- 
iary language model objective follows this line of 
thought and intends to boost the performance of 
the higher classification layer. 


3 Our Model 


We introduce SiATL, which stands for Single-step 
Auxiliary loss Transfer Learning. In our proposed 
approach, we first train a LM. We then transfer its 
weights and add a task-specific recurrent layer to 
the final classifier. We also employ an auxiliary 
LM loss to avoid catastrophic forgetting. 


LM Pretraining. We train a word-level language 
model, which consists of an embedding LSTM 
layer (Hochreiter and Schmidhuber, 1997), 2 hid- 
den LSTM layers and a linear layer. We want to 
minimize the negative log-likelihood of the LM: 
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where p(x? |x}, ..., £1) is the distribution of the 
tt? word in the n“” sentence given the t — 1 words 
preceding it and N is total number of sentences. 


Transfer & auxiliary loss. We transfer the 
weights of the pretrained model and add one 
LSTM with a self-attention mechanism (Lin et al., 
2017; Bahdanau et al., 2015). 


In order to adapt the contribution of the pretrained 
model to the task at hand, we introduce an auxil- 
iary LM loss during training. The joint loss is the 
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Figure 1: High-level overview of our proposed TL ar- 
chitecture. We transfer the pretrained LM add an extra 
recurrent layer and an auxiliary LM loss. 


weighted sum of the task-specific loss Digs, and 
the auxiliary LM loss Lr m, where y is a weight- 
ing parameter to enable adaptation to the target 
task but at the same time keep the useful knowl- 
edge from the source task. Specifically: 


L = Liask + YLM (2) 


Exponential decay of y. An advantage of the pro- 
posed TL method is that the contribution of the 
LM can be explicitly controlled in each training 
epoch. In the first few epochs, the LM should con- 
tribute more to the joint loss of SiATL so that the 
task-specific layers adapt to the new data distribu- 
tion. After the knowledge of the pretrained LM 
is transferred to the new domain, the task-specific 
component of the loss function is more important 
and y should become smaller. This is also crucial 
due to the fact that the new, task-specific LSTM 
layer is randomly initialized. Therefore, by back- 
propagating the gradients of this layer to the pre- 
trained LM in the first few epochs, we would add 
noise to the pretrained representation. To avoid 
this issue, we choose to initially pay attention to 
the LM objective and gradually focus on the clas- 
sification task. In this paper, we use an exponential 
decay for y over the training epochs. 


Sequential Unfreezing. Instead of fine-tuning all 
the layers simultaneously, we propose unfreezing 


them sequentially, according to Howard and Ruder 
(2018); Chronopoulou et al. (2018). We first fine- 
tune only the extra, randomly initialized LSTM 
and the output layer for n — 1 epochs. At the nt” 
epoch, we unfreeze the pretrained hidden layers. 
We let the model fine-tune, until epoch k — 1. Fi- 
nally, at epoch k, we also unfreeze the embedding 
layer and let the network train until convergence. 
The values of n and k are obtained through grid 
search. We find the sequential unfreezing scheme 
important, as it minimizes the risk of overfitting to 
small datasets. 


Optimizers. While pretraining the LM, we use 
Stochastic Gradient Descent (SGD). When we 
transfer the LM and fine-tune on each classifica- 
tion task, we use 2 different optimizers: SGD for 
the pretrained LM (embedding and hidden layer) 
with a small learning rate, in order to preserve its 
contextual information. As for the new, randomly 
initialized LSTM and classification layers, we em- 
ploy Adam (Kingma and Ba, 2015), in order to al- 
low them to train fast and adapt to the target task. 


Dataset Domain # classes # examples 
Trony18 Tweets 4 4618 
Sent17 Tweets 3 61854 
SCv2 Debate Forums 2 3260 
SCv1 Debate Forums 2 1995 
PsychExp Experiences 7 7480 


Table 1: Datasets used for the downstream tasks. 


4 Experiments and Results 


4.1 Datasets 


To pretrain the language model, we collect a 
dataset of 20 million English Twitter messages, 
including approximately 2M unique tokens. We 
use the 70K most frequent tokens as vocabu- 
lary. We evaluate our model on five datasets: 
SentI7 for sentiment analysis (Rosenthal et al., 
2017), PsychExp for emotion recognition (Wall- 
bott and Scherer, 1986), Irony1& for irony detec- 
tion (Van Hee et al., 2018), SCv/ and SCv2 for 
sarcasm detection (Oraby et al., 2016; Lukin and 
Walker, 2013). More details about the datasets can 
be found in Table 1. 


4.2 Experimental Setup 


To preprocess the tweets, we use Ekphra- 
sis (Baziotis et al., 2017). For the generic datasets, 
we use NLTK (Loper and Bird, 2002). For the 
NBoW baseline, we use word2vec (Mikolov et al., 
2013) 300-dimensional embeddings as features. 


Irony 18 Sent17 SCv2 SCv1 PsychExp 
BoW 43.7 61.0 65.1 60.9 25.8 
NBoW 45.2 63.0 61.1 51.9 20.3 
P-LM 42.7 + 0.6 61.2 + 0.7 69.4 + 0.4 48.5+ 1.5 | 38.3 + 0.3 
P-LM + su 41.8 + 1.2 62.1 + 0.8 69.9 + 1.0 48.4 + 1.7 | 38.7 + 1.0 
P-LM + aux 45.5 + 0.9 65.1 + 0.6 72.6 + 0.7 55.8 + 1.0 | 40.9+0.5 
SiATL (P-LM + aux + su) 47.0 + 1.1 66.5 + 0.2 75.0 + 0.7 56.8 + 2.0 | 45.8+ 1.6 
ULMEFiT (Wiki-103) 23.6 + 1.6 60.5 + 0.5 68.7 + 0.6 56.6+0.5 | 21.8+0.3 
ULMEiT (Twitter) 41.6 + 0.7 65.6 + 0.4 67.2 + 0.9 44.0+0.7 | 40.2+1.1 
State of the-art 53.6 68.5 76.0 69.0 57.0 
(Baziotis et al., 2018) | (Cliche, 2017) | (lic et al., 2018) (Felbo et al., 2017) 


Table 2: Ablation study on various downstream datasets. Average over five runs with standard deviation. BoW 
stands for Bag of Words, NBoW for Neural Bag of Words. P-LM stands for a classifier initialized with our 
pretrained LM, su for sequential unfreezing and aux for the auxiliary LM loss. In all cases, F} is employed. 


For the neural models, we use an LM with an em- 
bedding size of 400, 2 hidden layers, 1000 neurons 
per layer, embedding dropout 0.1, hidden dropout 
0.3 and batch size 32. We add Gaussian noise of 
size 0.01 to the embedding layer. A clip norm of 
5 is applied, as an extra safety measure against ex- 
ploding gradients. For each text classification neu- 
ral network, we add on top of the transferred LM 
an LSTM layer of size 100 with self-attention and 
a softmax classification layer. In the pretraining 
step, SGD with a learning rate of 0.0001 is em- 
ployed. In the transferred model, SGD with the 
same learning rate is used for the pretrained layers. 
However, we use Adam (Kingma and Ba, 2015) 
with a learning rate of 0.0005 for the newly added 
LSTM and classification layers. For developing 
our models, we use PyTorch (Paszke et al., 2017) 
and Scikit-learn (Pedregosa et al., 2011). 


5 Results & Discussion 


Baselines and Comparison. Table 2 summarizes 
our results. The top two rows detail the baseline 
performance of the BoW and NBoW models. We 
observe that when enough data is available (e.g. 
Sent17), baselines provide decent results. Next, 
the results for the generic classifier initialized from 
a pretrained LM (P-LM) are shown with and with- 
out sequential unfreezing, followed by the results 
of the proposed model SiATL. SiATL is also di- 
rectly compared with its close relative ULMFiT 
(trained on Wiki-103 or Twitter) and the state-of- 
the-art for each task; ULMEFIiT also fine-tunes a 
LM for classification tasks. The proposed SiATL 
method consistently outperforms the baselines, the 
P-LM method and ULMFiT in all datasets. Even 
though we do not perform any elaborate learn- 
ing rate scheduling and we limit ourselves to pre- 


training in Twitter, we obtain higher results in two 
Twitter datasets and three generic. 


Auxiliary LM objective. The effect of the auxil- 
iary objective is highlighted in very small datasets, 
such as SCv/, where it results in an impressive 
boost in performance (7%). We hypothesize that 
when the classifier is simply initialized with the 
pretrained LM, it overfits quickly, as the target vo- 
cabulary is very limited. The auxiliary LM loss, 
however, permits refined adjustments to the model 
and fine-grained adaptation to the target task. 


Exponential decay of y. For the optimal y in- 
terval, we empirically find that exponentially de- 
caying y from 0.2 to 0.1 over the number of train- 
ing epochs provides best results for our classifica- 
tion tasks. A heatmap of y is depicted in Figure 3. 
We observe that small values of y should be em- 
ployed, in order to scale the LM loss in the same 
order of magnitude as the classification loss over 
the training period. Nevertheless, the use of ex- 
ponential decay instead of linear decay does not 
provide a significant improvement, as our model 
is not sensitive to the way of decaying hyperpa- 
rameter y. 


Sequential Unfreezing. Results show that se- 
quential unfreezing is crucial to the proposed 
method, as it allows the pretrained LM to adapt 
to the target word distribution. The performance 
improvement is more pronounced when there is 
a mismatch between the LM and task domains, 
i.e., the non-Twitter domain tasks. Specifically 
for the PsychExp and SCv2 datasets, sequentially 
unfreezing yields significant improvement in Fı 
building upon our intuition. 

Number of training examples. Transfer learning 
is particularly useful when limited training data 
are available. We notice that for our largest dataset 
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Figure 2: Results of SiATL, our proposed approach 
(continuous lines) and ULMFiT (dashed lines) for dif- 
ferent datasets (indicated by different markers) as a 
function of the number of training examples. 


Sentl7, SiATL outperforms ULMFiT only by a 
small margin when trained on all the training ex- 
amples available (see Table 2), while for the small 
SCv2 dataset, SiATL outperforms ULMFiT by a 
large margin and ranks very close to the state-of- 
the-art model (Ilic et al., 2018). Moreover, the 
performance of SiATL vs ULMFiT as a function 
of the training dataset size is shown in Figure 2. 
Note that the proposed model achieves competi- 
tive results on less than 1000 training examples for 
the Irony18, SCv2, SCv1 and PsychExp datasets, 
demonstrating the robustness of SiATL even when 
trained on a handful of training examples. 
Catastrophic forgetting. We observe that SiATL 
indeed provides a way of mitigating catastrophic 
forgetting. Empirical results that are shown in Ta- 
ble 2 indicate that by only adding the auxiliary lan- 
guage modeling objective, we obtain better results 
on all downstream tasks. Specifically, a compari- 
son of the P-LM + aux model and the P-LM model 
shows that the performance of SiATL on classifi- 
cation tasks is improved by the auxiliary objective. 
We hypothesize that the language model objective 
acts as a regularizer that prevents the loss of the 
most generalizable features. 


6 Conclusions and Future Work 


We introduce SiATL, a simple and efficient trans- 
fer learning method for text classification tasks. 
Our approach is based on pretraining a LM and 
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Figure 3: Heatmap of the effect of y to F-score, eval- 
uated on SCv2. The horizontal axis depicts the initial 
value of y and the vertical axis the final value of y. 


transferring its weights to a classifier with a task- 
specific layer. The model is trained using a task- 
specific functional with an auxiliary LM loss. 
SiATL avoids catastrophic forgetting of the lan- 
guage distribution learned by the pretrained LM. 
Experiments on various text classification tasks 
yield competitive results, demonstrating the effi- 
cacy of our approach. Furthermore, our method 
outperforms more sophisticated transfer learning 
approaches, such as ULMEFiT in all tasks. 

In future work, we plan to move from Twitter to 
more generic domains and evaluate our approach 
to more tasks. Additionally, we aim at exploring 
ways for scaling our approach to larger vocabu- 
lary sizes (Kumar and Tsvetkov, 2019) and for 
better handling of out-of-vocabulary words (OOV) 
(Mielke and Eisner, 2018; Sennrich et al., 2015) in 
order to be applicable to diverse datasets. 

Finally, we want to explore approaches for im- 
proving the adaptive layer unfreezing process and 
the contribution of the language model objective 
(value of y) to the target task. 
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